For this project the data from the “VII Encuesta de Presupuestos Familiares” (VII Household Budget Survey) was selected. This is a survey done every 2 to 3 years in Chile, that contains data about household’s inhabitants, some of their social characteristics as income, education, gender, age, etc., and the expenses each household have within a month. The data is splited in two data sets, one with the inhabitant description and income variables, with some groupping categories by households (households); the other with the expenses reported in several categories by household id (expenses).
Some data wrangling will be needed before starting some of the exploratory data analysis, given that the households data set have one entry by each household member, while the expenses data set contains the expenses only by household not separated by household inhabitant: this means that is possible to merge the data by matching the household ids, but not by individuals (because that was the intended use for the data).
The following libraries were used for this work:
ggplot2gridExtraGGallyggthemesdplyrtidyrknitrThe data is stored in RData files, after being transformed from SPSS data sets.
load("households.RData")
load("expenses.RData")
Some cleaning is still needed, for example there two negative ages and some households without a total income reported, because of missing data.
households <- subset(households, age > 0 & !is.na(income.hh.av.rent))
Let’s start with some simple explorations of the population in our data set. From the variables descriptions we decided to focus on the following variables:
What is the population’s age distribution? The summary function can give us a start:
summary(households[,'age'])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 17.0 32.0 34.9 51.0 103.0
The next figure shows an histogram using the age variable (a discrete numerical variable). The binwidths are equal to 1 year. The figure shows that the population is not normally distributed and positevely skewed overall. This is expected, since the population must decrese with age as people dies by accidents, illness or natural causes.
However is interesting to notice some peaks at around 5, 25 and 50 years of age: they might correspond to generations with higher natallity rates or less infant mortality.
What about the education attainment distribution?
Let’s change the x scale so we can see more details on the distribution if the household’s income.
decil <- quantile(houseinc$income.hh.av.rent, probs = seq(0, 1, 0.1), na.rm = T)
decil[1] <- 0
decil[11] <- decil[11] + 100
households$income.dec <- cut(households$income.hh.av.rent, decil, right = F)
houseinc$income.dec <- cut(houseinc$income.hh.av.rent, decil, right = F)
levels(houseinc$income.dec) <- c(
"US$3 to US$419", "US$420 to US$575", "US$576 to US$738", "US$739 to US$907",
"US$908 to US$1109", "US$1110 to US$1369", "US$1370 to US$1709",
"US$1710 to US$2309", "US$2310 to US$3609", "US$3610 and Up" )
levels(households$income.dec) <- levels(houseinc$income.dec)
indec <- subset(houseinc, !is.na(income.dec)) %>%
group_by(income.dec) %>%
summarise(Total = sum(income.hh.av.rent, na.rm = T))
So, while the mean household income is US$1707 the median is at US$1110. How it does compare with other countries?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.96 653.40 1110.00 1707.00 1976.00 53280.00
summary(inhab$num.inhabitants)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.389 4.000 15.000
ord <- as.character(
arrange(data.frame(table(households$edependence)), desc(Freq))$Var1)
ggplot(aes(factor(edependence, levels = ord)), data = households) +
geom_histogram(fill = "#67a9cf") +
theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
xlab('Educational Institution Type (private/public)') +
ylab('Frequency')
ggplot(aes(factor(edependence, levels = ord),
y = 100 * ..count.. / sum(..count..)),
data = households[households$edependence %in% ord[c(3,2,5)], ]) +
geom_histogram(fill = "#67a9cf") +
theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
xlab('Primary Education Type (private/public)') +
ylab('Percentage')
summary(subset(households, !is.na(health.exp))$health.exp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.402 27.240 44.360 80.980 91.500 2279.000
ggplot(aes(health.exp), data = subset(households, !is.na(health.exp))) +
geom_histogram(fill = "#67a9cf", binwidth = 10) +
xlab('Health Expenditure (US$)') +
ylab('Frequency')
ggplot(aes(health.exp), data = subset(households, !is.na(health.exp))) +
geom_histogram(fill = "#67a9cf", binwidth = 0.2) +
scale_x_continuous(trans = "log1p", breaks = c(5,50,100,200,1000,2000)) +
xlab('Health Expenditure (US$)') +
ylab('Frequency')
We can arrange a little bit more this plot and create a population pyramid:
The peaks seem to change for each gender! We can also notice that there are more women (53.2%) than men (46.8%). Are the gender’s average age different?
We see a difference in the average age for both gender, with males having an overall younger population.
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
|---|---|---|---|---|---|---|
| Men | 1 | 16 | 30 | 33.49 | 50 | 101 |
| Women | 1 | 18 | 34 | 36.15 | 52 | 103 |
Let’s test if the difference in statistical significant by using the Wilcoxon Rank Test:
| Test statistic | P value | Alternative hypothesis |
|---|---|---|
| 143282477 | 2.947e-28 * * * | two.sided |
What about the education attainment distribution? We will study the distribution using an stacked histogram, so we can study at the same time if there are any significant differences between genders.
We see a pick in the distribution at category 5, which correspond to the primary education. This doesn’t mean that most of the population only reach primary school: we have not removed the school-age population. We could use two variables to subset the data and include only the population that is no longer studying:
edu.finish: indicates wheter the person did or not finish the educational attainment reported.age: we subset only the population over 30 years old, assuming that most of the people is no longer studying at that age.As shown in the Introduction
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## Warning: Removed 45 rows containing non-finite values (stat_boxplot).
## Warning: Removed 45 rows containing missing values (geom_point).
What are households expending on?
tabex <- expenses %>% group_by(description) %>% summarise(expend = sum(expense))
usar <- as.character(
as.data.frame(tabex[order(-tabex$expend)[2:33][-17][-14], ])[, 1])
highexp <- subset(expenses, description %in% usar)
tabexp <- expenses %>% group_by(description) %>% summarise(co = n())
usar1 <- as.character(
as.data.frame(tabexp[order(-tabexp$co)[1:31][-5], ])[, 1])
moreexp <- subset(expenses, description %in% usar1)
tabexp2 <- highexp %>% group_by(description) %>% summarise(med = median(expense))
usar2 <- as.character(as.data.frame(tabexp2[order(-tabexp2$med), ])[, 1])
ggplot(aes(factor(description, levels = usar2), expense), data = highexp) +
geom_jitter(alpha = 0.3, color = "#67a9cf") +
geom_boxplot(alpha = 0.8) +
scale_y_log10() +
scale_x_discrete(labels = 1:30) +
xlab('Expenditure Category') +
ylab('Expenditure Amount (US$)')
tabexp3 <- moreexp %>% group_by(description) %>% summarise(med = median(expense))
usar3 <- as.character(as.data.frame(tabexp3[order(-tabexp3$med), ])[, 1])
ggplot(aes(factor(description, levels = usar3), expense), data = moreexp) +
geom_jitter(alpha = 0.3, color = "#67a9cf") +
geom_boxplot(alpha = 0.8) +
scale_y_log10(limits = c(0.1, 10000)) +
scale_x_discrete(labels = 1:30) +
xlab('Expenditure Category') +
ylab('Expenditure Amount (US$)')
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).
## Warning: Removed 3 rows containing missing values (geom_point).
households$age.group <- cut(households$age, seq(15, 81, 5), right = F)
ggplot(aes(age.group),
data = subset(households, !is.na(dep.work.income) & !is.na(age.group))) +
geom_boxplot(aes(y = dep.work.income, fill = gender)) +
scale_y_continuous(trans = 'log1p') +
scale_fill_brewer(palette = "Paired")
t1 <- dcast(subset(households, !is.na(dep.work.income)),
age.group ~ gender,
fun = median,
value.var = 'dep.work.income')
w <- households %>%
subset(!is.na(dep.work.income)) %>%
group_by(age.group, gender) %>%
summarise(inco = median(dep.work.income))
ggplot(aes(age.group), data = subset(w, !is.na(age.group))) +
geom_histogram(aes(y = inco, fill = gender), stat = 'identity',
alpha = 0.5, position = "identity", color = 'grey') +
scale_fill_brewer(palette = "Paired")
t <-
left_join(
expenses,
subset(households, person.id == 1,
select = c(num.inhabitants, income.dec, home.id)),
by = 'home.id') %>%
subset(!is.na(income.dec))
expsum <-
t %>%
group_by(income.dec, d) %>%
summarise(total = sum(expense), median = median(expense), quant = n())
ggplot(aes(d, y = total), data = expsum) +
geom_bar(aes(fill = d), stat = 'identity') +
scale_x_discrete(labels = '') +
scale_fill_brewer(palette = "Paired") +
facet_grid( ~ income.dec)
ggplot(aes(income.dec, y = median), data = expsum) +
geom_bar(aes(fill = income.dec), stat = 'identity') +
scale_x_discrete(labels = '') +
scale_fill_brewer(palette = "Paired") +
facet_grid( ~ d)
expsum <-
expsum %>%
group_by(income.dec) %>%
mutate(suma = sum(total))
ggplot(aes(income.dec, y = 100 * total / suma), data = expsum) +
geom_bar(aes(fill = income.dec), stat = 'identity') +
scale_x_discrete(labels = '') +
scale_fill_brewer(palette = "Paired") +
facet_grid( ~ d)
| D Code | Description |
|---|---|
| 01 | Food and non-alcoholic beverages |
| 02 | Alchoholic beverages, tobacco and narcotics |
| 03 | Clothing and footwear |
| 04 | Housing, water, electricity, gas and other fuels |
| 05 | Furnishings, household equipment and routine household maintenance |
| 06 | Health |
| 07 | Transport |
| 08 | Communication |
| 09 | Recreation and culture |
| 10 | Education |
| 11 | Restaurants and hotels |
| 12 | Miscellaneous goods and services |